# dsl-image-simulation
Simulations for DSL image paper

## Effects of Errors

We run simulations based on the Won et al data in order to show how errors in the surrogates and errors in the expert labels affect the properties of the DSL estimator.

We use the subset of observations from the Won et al data classified as being of protest (`protest==1`). This includes 11659 observations of 11 variables. All variables are constructed by Won et al. in their analysis of the image data. One variable, `violence` is continuous on the unit interval; the remainder are binary.

For both sets of simulations, we fit the following linear model: `violence ~ sign + photo + fire + police + children + group_20 + flag + night + shouting`. We treat `fire` as the unmeasured variable proxied by an imperfect surrogate.  

The first set of simulations tests the relationship between the error rate in the surrogate and the number of expert-labeled observations. 

Our design grid consists of varying the number of expert labels (100, 250, 500, 750, 1000, 2000) and surrogate accuracy (0.5, 0.75, 0.9, 0.95, 0.99, 1.0). For each design, we run 100 iterations of the following simulation:

1. Resample the data with replacement: $\mathal{D}_i$
2. Fit the oracle model using the ground truth values of the data and surrogate.
3. Create an imperfect surrogate by randomly corrupting (1-acc)*N of the values.
4. Fit the surrogate-only estimator and DSL estimator
5. Record the coefficients and standard errors for the oracle, SO and DSL estimates.

In step 3, We make the propensity of an error in the surrogate a function of a linear combination of the other covariates in the model. We think that this reflects the realistic setting where the probability of the surrogate model making mistakes is related to features of the data itself.

We then calculate the true parameter values as the average of the oracle model coefficients across all 100 iterations, and calculate our diagnosands as follows:

- Bias: the average absolute difference between the estimate and true parameter value ($\frac{1}{N} \sum_i^N abs(\beta_{DSL, i} - \beta^*)$, where $i$ denotes simulation iteration, $N$ denotes the number of simulations and $\beta^* = \frac{1}{N}\sum_i^N\beta_{oracle, i}$)
- RMSE: the average RMSE
- Coverage: the proportion of simulations for which the true parameter value is in the confidence interval provided by the estimator

In the second set of simulations, we fix the surrogate accuracy to $0.75$ and introduce errors into the expert labels (accuracy 0.75, 0.8, 0.9, 0.95, 0.99, 1.0).


